Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Instance selection algorithm for big data based on random forest and voting mechanism
ZHOU Xiang, ZHAI Junhai, HUANG Yajie, SHEN Ruicai, HOU Yingzhen
Journal of Computer Applications    2021, 41 (1): 74-80.   DOI: 10.11772/j.issn.1001-9081.2020060982
Abstract494)      PDF (906KB)(491)       Save
To deal with the problem of instance selection for big data, an instance selection algorithm based on Random Forest (RF) and voting mechanism was proposed for big data. Firstly, a dataset of big data was divided into two subsets:the first subset is large and the second subset is small or medium. Then, the first large subset was divided into q smaller subsets, and these subsets were deployed to q cloud computing nodes, and the second small or medium subset was broadcast to q cloud computing nodes. Next, the local data subsets at different nodes were used to train the random forest, and the random forest was used to select instances from the second small or medium subset. The selected instances at different nodes were merged to obtain the subset of selected instances of this time. The above process was repeated p times, and p subsets of selected instances were obtained. Finally, these p subsets were used for voting to obtain the final selected instance set. The proposed algorithm was implemented on two big data platforms Hadoop and Spark, and the implementation mechanisms of these two big data platforms were compared. In addition, the comparison between the proposed algorithm with the Condensed Nearest Neighbor (CNN) algorithm and the Reduced Nearest Neighbor (RNN) algorithm was performed on 6 large datasets. Experimental results show that compared with these two algorithms, the proposed algorithm has higher test accuracy and smaller time consumption when the dataset is larger. It is proved that the proposed algorithm has good generalization ability and high operational efficiency in big data processing, and can effectively solve the problem of big data instance selection.
Reference | Related Articles | Metrics